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In many fields, researchers are interested in large and complex bi- 
ological processes. Two important examples are gene expression and 
DNA methylation in genetics. One key problem is to identily aberrant 
patterns of these processes and discover biologically distinct groups. 
In this article we develop a model-based method for clustering such 
data. The basis of our method involves the construction of a like- 
lihood for any given partition of the subjects. We introduce cluster 
specific latent indicators that, along with some standard assumptions, 
impose a specific mixture distribution on each cluster. Estimation is 
carried out using the EM algorithm. The methods extend naturally to 
multiple data types of a similar nature, which leads to an integrated 
analysis over multiple data platforms, resulting in higher discrimi- 
nating power. 

1. Introduction. Epigenetics refers to the study of heritable character- 
istics not explained by changes in the DNA sequence. The most studied 
epigenetic alteration is cytosine (one of the four bases of DNA) methyla- 
tion, which involves the addition of a methyl group (a hydrocarbon group 
occurring in many organic compounds) to the cytosine. Cytosine methyla- 
tion plays a fundamental role in epigenetically controlling gene expression, 
and studies have shown that aberrant DNA methylation patterning occurs 
in inflammatory diseases, aging, and is a hallmark of cancer cells [Roden- 
hiser and Mann (2006); and Figueroa et al. (2010)]. Figueroa et al. (2010) 
performed the first large-scale DNA methylation profiling study in humans, 
where they hypothesized that DNA methylation is not randomly distributed 
in cancer but rather is organized into highly coordinated and well-defined 
patterns, which reflect distinct biological subtypes. Similar observations had 
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already been made for expression data [Golub et al. (1999); Armstrong et al. 
(2002)]. Identifying such biological subtypes through abnormal patterns is 
a very important task, as some of these malignancies are highly heteroge- 
neous, presenting major challenges for accurate clinical classification, risk 
stratification and targeted therapy. The discovery of aberrant patterns in 
subjects can identify tumors or disease subtypes and lead to a better under- 
standing of the underlying biological processes, which in turn can guide the 
design of more specifically targeted therapies. Due to the biological inter- 
action between methylation and expression, biologists hope to optimize the 
amount of biological information about cancer malignancies by borrowing 
strength across both platforms. As an example Figueroa et al. (2008) showed 
that the integration of gene expression and epigenetic platforms could be 
used to rescue genes that were biologically relevant but had been missed by 
the individual analyses of either platform separately. 

In this article we propose a model-based approach to clustering such high- 
dimensional microarray data. In particular, we build finite mixture models 
that guide the clustering. These types of models have been shown to be 
a principled statistical approach to practical issues that can come up in 
clustering [McLachlan and Basford (1988); Banfield and Raftery (1993); 
Cheeseman and Stutz (1995); Fraley and Raftery (1998, 2002)]. The moti- 
vating application is the cluster analysis of Figueroa et al. (2010), which 
focused on patients with Acute Myeloid Leukemia (AML). Both methyla- 
tion and expression data are available and we develop a clustering method 
that can be applied to both data types separately. Furthermore, we extend 
our methodology to facilitate an integrated cluster analysis of both data 
platforms simultaneously. Although the methods are designed for these par- 
ticular applications, we expect that they can be applied to other types of 
microarray data, such as ChlP-chip data. 

A lot of attention has been given to classification based on gene expres- 
sion profiles and more recently based on methylation profiles. Siegmund, 
Laird and Laird-Offringa (2004) give an overview and comparison of several 
clustering methods on DNA methylation data. They point out that among 
biologists, agglomerative hierarchical cluster analysis is popular. However, 
they argue in favor of model-based clustering methods over nonparametric 
approaches and propose a Bernoulli-lognormal model for the discovery of 
novel disease subgroups. This model had previously been applied by Ibrahim, 
Chen and Gray (2002) to identify differentially expressed genes and profiles 
that predict known disease classes. More recently, Houseman et al. (2008) 
proposed a Recursively Partitioned Mixture Model algorithm (RPMM) for 
clustering methylation data using beta mixture models [Ji et al. (2005)]. 
They proposed a beta mixture on the subjects and the objective was to clus- 
ter subjects based on posterior class membership probabilities. The RPMM 
approach is a model-based version of the HOPACH clustering algorithm 
developed in van der Laan and Pollard (2003). 
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In high-dimensional data clustering is often performed on a smaller sub- 
set of the variables. In fact, as pointed out in Tadesse, Sha and Vannucci 
(2005), using all variables in high-dimensional clustering analysis has proven 
to give misleading results. There is some literature on the problem of simul- 
taneous clustering and variable selection [Friedman and Meulman (2003); 
Tadesse, Sha and Vannucci (2005); Kim, Tadesse and Vannucci (2006)]. 
However, most statistical methods cluster the data only after a suitable 
subset has been chosen. An example of such practice is McLachlan, Bean 
and Peel (2002), where the selection of a subset involves choosing a signifi- 
cance threshold for the covariates. That is also essentially what Houseman 
et al. (2008) and Figueroa et al. (2010) did, but they selected a subset of 
the most variable DNA fragments. In this paper we present an integrated 
model-based hierarchical clustering algorithm that clusters samples based on 
multiple data types on the most variable features. There is of course a clear 
advantage of automated variable selection methods such as in Tadesse, Sha 
and Vannucci (2005). However, the implementation of such methods seems 
far from straightforward and due to the popularity of hierarchical algorithms 
among biologists [Kettenring (2006) concluded that hierarchical clustering 
was by far the most widely used form of clustering in the scientific litera- 
ture], there is a clear benefit in having a simple hierarchical algorithm that 
can handle multiple data types. 

The article is organized as follows. In Section 2 we describe the features of 
the motivating data set. In Section 3 we construct the model as a mixture of 
Gaussian densities, which leads to a specific mixture likelihood that serves 
as an objective function for clustering. We also introduce individual specific 
parameters to account for subject to subject variability within clusters (i.e., 
the array effect). In Section 4 we present two model-based clustering algo- 
rithms. The first algorithm is a hierarchical clustering algorithm that can 
be used to find a good candidate partition. The second clustering algorithm 
is an iterative algorithm that is designed to improve upon any initial parti- 
tion. The likelihood model can be applied to classification of new subjects 
and in Section 5 we describe a discriminant rule for this purpose. In Sec- 
tion 6 we extend the model to account for multiple data platforms and in 
Sections 7 and 8 we apply the methods to real data sets, which involve both 
methylation and expression data. We conclude the article with a discussion 
in Section 9. 

2. Motivating data. The Erasmus data were obtained from AML sam- 
ples collected at Erasmus University Medical Center (Rotterdam) between 
1990 and 2008 and involve DNA methylation and expression profiles of 
344 patient specimens. For each specimen it was confirmed that >90% of 
the cells were blasts (leukemic cells). Description of the sample processing 
can be found in Valk et al. (2004) and data sets are available in GEO, 
http://www.ncbi.nlm.nih.gov/geo/, with accession numbers GSE18700 
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for the methylation data and GSE6891 for the expression data. The gene ex- 
pression profiles of the AML samples were determined using oligonucleotide 
microarrays (Affymetrix U133Plus2.0 GeneChips) and were normalized us- 
ing the rma normalization method of Irizarry et al. (2003). The processed 
data involved 54,675 probe sets and demonstrated a right skewed distribu- 
tion of the expression profiles for each subject; see Supplementary Figure 
1 in the supplemental article [Kormaksson et al. (2012)]. The methylation 
profiles of the AML samples were determined using high density oligonu- 
cleotide genomic HELP arrays from NimbleGen Systems that cover 25,626 
probe sets at gene promoters, as well as at imprinted genes. Briefly, genomic 
DNA is isolated and digested by the enzymes Hpall and Mspl, separately. 
While Hpall is only able to cut the DNA at its unmethylated recognition 
motif (genomic sequence 5'-CCGG-3'), Mspl cuts the DNA at any Hpall 
site whether methylated or unmethylated. Following PGR, the HpaH and 
Mspl digestion products are labeled with different fluorophores and then co- 
hybridized on the microarray. This results in two average signal intensities 
that measure the relative abundances (in a population of cells) of Mspl and 
HpaH at each probe set. The data are preprocessed and normalized using 
the analytical pipeline of Thompson et al. (2008) and the final quantity of 
interest is log(Hpan/MspI). Note that although theoretically HpaH should 
always be less than Mspl, complex technical aspects that arise during the 
preparation and hybridization of these samples may result in an enrichment 
of the HpaH signal over that of MspL Therefore, the log-ratio does not have 
a one-to-one correspondence with percent methylation at a given probe set 
but rather provides a relative methylation value that correlates with actual 
percentage value (see lower left panel of Figure 1). 

In what follows, notation will be based on the HELP methylation data. 
However, if we abstract away from this particular application, the terminol- 
ogy can be adapted to other microarray data such as gene expression. Let yij 
denote the continuous response log(Hpanjj- /Msplj^ ) for subject i = 1, . . . , n, 
and probe set j = 1, . . . , G. Lower values of yij indicate that probe set j has 
high levels of methylation (in a population of cells) for subject i, whereas 
higher values indicate low levels of methylation. In the upper panel of Fig- 
ure 1 we see bimodal histograms of the methylation profiles for two pa- 
tients in the AML data set along with two component Gaussian mixture 
fits. In Supplementary Figure 2 of the supplemental article [Kormaksson 
et al. (2012)] we see density profiles for all 344 samples stratified by clus- 
ters. There is a large microarray effect in the methylation data, but we 
observe that all profiles are either skewed or exhibit a bimodal behavior. 
The lower left panel of Figure 1 shows how the HELP assay correlates with 
methylation percentages obtained using the more accurate, but much more 
expensive, quantitative single locus DNA methylation validation MASS Ar- 
ray [see Figueroa et al. (2010)]. It is clear that the HELP values are forming 
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Fig. 1. Upper panel: A histogram of the log signal ratio, log(HpaII/MspI), for patients 
number 41 and 42 in the Erasmus data set, along with two component Gaussian mix- 
ture fits. Lower panel: Left graph shows HELP methylation values plotted against more 
accurate MASS array methylation percentages. Right graph shows that the posterior prob- 
abilities from a two component Gaussian mixture classifies probe sets well into low and 
high methylation. 

two clusters of relatively low or high methylation levels with some noise 
in the percentage range [20%, 80%]. This apparent dichotomization inspires 
modeling each individual profile, yj = {yn, ■ ■ ■ ^yic)' ^ with a two component 
mixture distribution and normality is assumed for each component due to 
its flexibility and ease of implementation. We know of no biological mecha- 
nism that would imply normality, however, the assumption gives consistent 
and reasonable fits of the individual methylation profiles (see upper panel 
of Figure 1). 

3. Model specification. By dichotomizing the methylation process we 
can cluster the probe sets into high or low methylation for each patient i 
by applying a two component Gaussian mixture model. Let C denote the 
true partition of the subject set, [n] = {l,...,n}. We assume that on any 
given probe set j, all subjects sharing a cluster c G C have the same relative 
methylation status (high or low), and introduce for each cluster c a single 
latent indicator vector, Wc = {wd, ■ ■ ■ , Wcg)' ■, with 

_ if j has high methylation for all subjects in cluster c, 

I 0, if j has low methylation for all subjects in cluster c. 

It is well known that methylation does exhibit biological variability from 
individual to individual. However, it is biologically reasonable to expect 
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consistency in relative methylation patterns for patients tliat share the same 
disease subtype. Define 6i = (//ij, cr^j, /X2j, alJ' and assume that the observed 
data, y = {y'l, . . . ,y^)', given the unobserved methylation indicators, w = 
(w^, . . . , w^)' (K being the number of clusters), arise from the following 
density: 

(2) f{y\^,o) = llllf{yi\w,,e. 

with 

(3) /(yilwc^i) = Y[<P{yij\f^ii^(^u)'""'4>{yij\fJ'2i,o-' 



G 

2 \1—Wn 



where (j) denotes the normal density and = {O'l^ . . . ,0^)'. We refer to the 
density in (2) as the classification likelihood of the observed data [Scott and 
Symons (1971); Symons (1981); Banfield and Raftery (1993)] and assume 
A^ii < /^2i for all i. We interpret Oi as the individual specific means and 
variances of the high and low methylation probe sets, respectively. Note that 
this setup is different from the usual model-based clustering setup where we 
have cluster specific parameters only. However, due to array effects, it is 
reasonable and in fact necessary to require different parameters for different 
subjects. In the upper panel of Figure 1 we see histograms and fits for 
two patients that both have chromosomal inversions at chromosome 16, 
inv(16) (inversions refer to when genetic material from a chromosome breaks 
apart and then, during the repair process, it is re-inserted back but the 
genetic sequence is inverted from its original sense). These two patients 
cluster together under various clustering algorithms, including the model- 
based algorithm presented below. However, the two distributions are clearly 
not equal. 

We put a Bernoulli prior on the latent methylation indicators in (1): 

G 

(4) /(w) = n n <c^^' , ^oc + vrie = 1, 

c6Cj=l 

where ttic and ttqc denote the proportions of probe sets that have high and 
low methylation, respectively, in cluster c. From (2) and (4) it is clear that 
the complete data density is 

( VrocfJ(?!'(yij|/i2i,0"2i) ) 

and if we integrate out the latent variable w, we arrive at the marginal 
likelihood 

ijlf^li^'^li) + TTOc JJ<?^(yij|/^2i,0"|j) j , 
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where * = {{TTic)ceC: (/^li; '^lij A*2i, cii)*} denotes the set of parameters. This 
hkehhood can be used as an objective function for determining the goodness 
of different partitions and the maximization of (6) is carried out with the EM 
algorithm of Dempster, Laird and Rubin (1977). Note that Lq can be written 
as a product, Hcec-^c, where Lc denotes the hkehhood contribution of clus- 
ter c. Thus, maximizing Lq can be achieved by maximizing Lc independently 
for all c € C. Details of the maximization algorithm are provided in Supple- 
mentary Appendix B of the supplemental article [Kormaksson et al. (2012)]. 

Remark 1. The premise of the clustering algorithm presented in Sec- 
tion 4 is to cluster subjects together that have similar methylation patterns. 
Similarities across the genome in the posterior probabilities of high/low 
methylation guide which subjects are clustered together and, thus, if the 
posterior probability predictions reflect the data well, the clustering algo- 
rithm should perform well. In the lower right panel of Figure 1 we see that 
the posterior probabilities of high methylation fit very well with the actual 
percentage values. 

Remark 2. When we allow for unequal variances af^ ^ Ugj, the likeli- 
hood in (6) is unbounded and does not have a global maximum. This can be 
seen by setting one of the means equal to one of the data points, say, fiu = 
yij, for some Then the likelihood approaches infinity as cr^j — >■ 0+. How- 
ever, McLachlan and Peel (2000), using the results of Kiefer (1978), point out 
that, even though the likelihood is unbounded, there still exists a consistent 
and asymptotically efficient local maximizer in the interior of the parameter 
space. They recommend running the EM algorithm from several different 
starting values, dismissing any spurious solution (on its way to infinity), 
and picking the parameter values that lead to the largest likelihood value. 

Remark 3. Note that the likelihood in (6) is identifiable except for the 
standard and unavoidable label switching problem in finite mixture models 
[see, e.g., McLachlan and Peel (2000)]. Furthermore, there exists a sequence 
of consistent local maximizers, as G ^ oo. This becomes more evident if 
one recognizes that the expression for a single cluster c can be written as 
a multivariate normal mixture 
G 

Lc = W{T^lc4'{ycj \^llc, Sic) + 7roc'/'(ycj |/^2c) ^Ic)), 

i=i 

where ycj = {vij, • • • , ?/n,j)', fJ-kc = {fJ-ki, • • • , A'fcnJ' and = diag(<T^-)"^i, 
k = 1,2 (assuming for convenience that i = 1, . . . ,nc are the members of 
cluster c). Standard theory thus applies [see McLachlan and Peel (2000)]. 

Remark 4. The correlation structure of high-dimensional microarray 
data is complicated and hard to model. Thus, we assume independence 
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across variables in the likelihood (6) even though it may not be the ab- 
solutely correct model. However, we can view (6) as a composite likelihood 
[see Lindsay (1988)] which yields consistent parameter estimates but with 
a potential loss of efficiency. The correlations observed in microarray data 
are usually mild and involve only a few and relatively small groups of genes 
that have moderate or high within-group correlations. In Supplementary 
Appendix A of the supplemental article [Kormaksson et al. (2012)] we per- 
form a simulation study to get a sense of how robust our algorithm is to this 
independence assumption. The results indicate that with a sparse overall 
correlation structure in which genes tend to group into many small clusters 
with moderate to high within-group correlations, our algorithm is not af- 
fected by assuming independence across variables. However, there is some 
indication that with larger groups of genes with very extreme within-group 
correlations the algorithm will break down. In microarray data such extreme 
correlation structures are not to be expected on a global scale and, therefore, 
we believe that the independence assumption is quite reasonable. 

4. Model-based clustering. Our clustering criterion involves finding the 
partition that gives the highest maximized likelihood Lq as given in (6). 
This provides us with a model selector, as we can compare the maximized 
likelihoods of any two candidate partitions. In theory we would like to maxi- 
mize Lq with respect to all possible partitions of [n] and simply pick the one 
resulting in the highest likelihood. However, as this is impossible for even 
moderately large n, we propose two clustering algorithms. In Section 4.1 
we propose a simple hierarchical clustering algorithm, and in Section 4.2 we 
propose an iterative algorithm that is designed to improve upon any initial 
partition. 

4.1. Hierarchical clustering algorithm. In this subsection we describe 
a simple hierarchical algorithm that attempts to find the partition that max- 
imizes Lc as given in (6). Heard, Holmes and Stephens (2006) used a similar 
approach, but they constructed a hierarchical Bayesian clustering algorithm 
that seeks the clustering leading to the maximum marginal posterior prob- 
ability. The algorithm can be summarized in the following steps: 

1. We start with the partition where each subject represents its own clus- 
ter, Ci = {{1}, . . . , {^i-}}, and calculate the maximized likelihood, Lq^. Note 
that this likelihood can be written as a product Lc^ = L|j} and, thus, the 
first step involves maximizing Ljj} for each z = 1, . . . , n. This is achieved by 
fitting a two-component Gaussian mixture to each of the n individual pro- 
files. As mentioned in Remark 2, each fit can be obtained by using the EM 
algorithm starting from several different initial values and finding a local 
maximum. It is important that the user verifies these initial individual fits 
before proceeding with the hierarchical algorithm. For example, by going 
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through the 344 methylation profile fits of the Erasmus data, one by one, 
we observe pleasing fits. The upper panel of Figure 1 gives examples of two 
such profile fits. 

2. Next we merge the two subjects that leads to the highest value of Lc 
and denote the maximized likelihood value by Lq^- Note that there are (2) 
many ways of merging two subjects at this step. However, since we already 
obtained fits for i = 1, . . . ,n, at Step 1, we only need to maximize 
L|j j/j, for all pairs and find the pair that maximizes 

^C2 = ^Ci - (%} + ^{i'}) + 

where i denotes the loglikelihood. Even though we are applying several EM 
algorithms, the complexity of each algorithm is low since it only involves 
two subjects at a time. 

3. We continue merging clusters under this maximum likelihood criteria, 
at each step making note of the maximized likelihood, until we are left with 
one cluster containing all n subjects, Cn = [n]. Among the n partitions that 
are obtained, we pick the partition that has the highest value of Lc- Note 
that the likelihood value may either increase or decrease at each step. This 
provides us with a method that automatically determines the number of 
clusters. 

It is our experience that the individual parameter estimates (/ii,, afj, /U2i; (^2i)i 
do not change much at each merging step of the hierarchical algorithm. Thus, 
if the initial estimates provide good fits for all the individual profiles, the 
algorithm can be expected to perform well. Furthermore, by using the indi- 
vidual parameter estimates at a previous merging step as initial values at 
the next step, each EM algorithm converges very quickly, which is essen- 
tial since the total number of EM algorithms that are conducted is of the 
order 0(?i^). For the data sets that we consider in this article, the hierar- 
chical algorithm takes anywhere from a couple of minutes to run, for the 
smallest data set in Section 8.1 (n = 14), up to a couple of hours for the 
Erasmus high-dimensional data set of Section 7.1 {n = 344), using a regular 
laptop. However, it should be noted that our R code is neither optimized 
nor precompiled to a lower level programming language at this stage. 

4.2. Iterative clustering algorithm. The hierarchical algorithm results in 
a partition that serves as a good initial candidate for the true partition. In 
this subsection we present an iterative algorithm that is designed to improve 
upon any initial partition. We introduce cluster membership indicators for 
the subjects in order to develop an EM algorithm for clustering subjects 
under the assumption of a fixed number of clusters. Define for each subject 
i = 1, . . . , n and cluster c € C 

^ _ f 1, if subject i is in cluster c, 
^'^ \0, otherwise. 
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and let Xj = {Xic)c£C- Assume Xi, . . . , X„ are i.i.d. Multinomjl, p = (pc)cec}) 
so the density of X = (X.[, . . . , X^)' is 

n 

(7) /(x)=nn^'^' T.p<^=^- 

i=l c&C ceC 

These cluster membership indicators fully define the partition C and we note 
that the classification likelihood in (2) can be written as 

n 

(8) /(yix)=nn^(y*i^^'^^)'''^- 

c£Ci=l 

Multiplying (7) and (8) together and integrating out X, we arrive at the 
marginal likelihood 

n 

(9) /(y;*) = nE^'-/(y*l^-^^)' 

i=l c£C 

where * = {(pc)cec, (wc)cGC,^i = (/^li, c^fj, ;^2i, o-ii)*} involves both the con- 
tinuous parameters and the discrete indicators, w, which we now assume are 
fixed. We make this assumption because if w is random as in (4), the joint 
posterior distribution of (w,X) is highly intractable and an EM algorithm 
based on (8) would be problematic. 

The likelihood in (9) is that of a finite mixture model and can be max- 
imized using the EM algorithm. We detail the maximization procedure in 
Supplementary Appendix B of the supplemental article [Kormaksson et al. 
(2012)]. In short, let X^''^ denote the clustering labels corresponding to a can- 
didate partition. Using X^''^ as an initial partition, we run an EM algorithm 
that converges to a local maximum of (9). Once the mixture model has been 
fitted, a probabilistic clustering of the subjects can be obtained through 
the fitted posterior expectations of cluster membership for the subjects, 
{E[Xic\y])i^c [see McLachlan and Peel (2000)]. Essentially, a subject will be 
assigned to the cluster to which it has the highest estimated posterior proba- 
bility of belonging. We have found empirically that the derived partition not 
only results in a higher value of (9) but also in the objective likelihood (6), 
but we do not have a theoretical justification for this. A good clustering strat- 
egy is to come up with a few candidate partitions, with varying numbers of 
clusters, and run the EM algorithm using these partitions as initial parti- 
tions. Each resulting partition will be a local maximum of (9), but we choose 
the partition with the highest value of the original objective function (6). 
Good initial partitions can be found by running the hierarchical algorithm 
of Section 4.1 or applying one of the more standard clustering algorithms. 

5. Classification. The construction of a likelihood for any given partition 
of the subjects also provides a powerful tool for classification. Assume we 
have methylation data on n subjects and we know which class each subject 
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belongs to, that is, we know the true C. A by-product of maximizing the 
Ukehhood in (6) with the EM algorithm [detailed in Supplementary Ap- 
pendix B of the supplemental article, Kormaksson et al. (2012)] is posterior 
predictions of the latent indicators, (wc)ceC5 which we round to either or 1. 
Given these estimated methylation indicators, the conditional likelihood of 
a new observation iyij)j, on the assumption that i G c, is given by 

G 

(10) LM = ]lHy^Mi^<^^if''Hy^Mu^2i)'~''^'■ 

The discriminant likelihood, Lc, is maximized with respect to the individual 
specific parameters at 

. _ Ej wcjVij .2 _ T^jWcjivij- hi? 

Ej Wcj 22 j Wcj 

E,(i-^c,) ' E,(i-^c,) • 

By substituting these estimates into (10) we arrive at the following discrim- 
inant rule: 

(11) i£c if Lc(0i(we)) > L,,{0,{^,,)) for all c' / c. 

6. Extension to multiple platforms. In this section we discuss how to 
extend the methods of this paper to account for multiple data types as long 
as each data type can reasonably be modeled by the model described in 
Section 3. For subject i = 1, . . . ,n let Uijk denote the signal response of the 
jth variable, j = 1, . . . ,Gk, on platform k = 1, ... ,171. As before, we let C 
denote the true partition of the n subjects. We assume subjects in a given 
cluster c G C have identical activity (methylation, expression, etc.) profiles on 
each platform k = 1, ... ,171 independently and define a cluster and platform 
specific indicator for each variable 

_ J 1, if variable j on platform k is active in cluster c, 
'^■^^ I 0, if variable j on platform k is inactive in cluster c. 

Define Wc = {wcjk)j,k and let yj = {yijkjj.k denote the vector of observed ac- 
tivity profiles of subject i across platforms. Let 6i = (fJ'iik, '^fik^ f^2ik, cr2ik)T=i 
denote the subject specific mixture parameters. We assume that the ob- 
served data, y = (y^, . . . , y^)"^, given the unobserved activity indicators, 
w = (wc)cgC) arise from the following density: 

fiy\w,e) = llY[fiy,\w,,e,), 
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where the conditional density of y^, on the assumption that i S c, is given 
by 

m 

k=lj=l 

We can either assume that the activity indicators for cluster c are fixed 
as in subection 4.2, or independent Bernoullis, both across platforms and 
variables, 

fM = n n <cf ^lt'' ' ^icfe + ^ock = 1, 
fc=ij=i 

where nick represents the proportions of variables on platform k that are 
active in cluster c. The likelihood in this integrated framework is identical 
to the one given in (6), except we now have an additional product across 
platforms k. The methods presented in Sections 4 and 5 thus extend to 
multiple platforms in a straightforward manner. 

7. Identifying subtypes of AML. Figueroa et al. (2010) performed the 
first large-scale DNA methylation profiling study in humans using the Eras- 
mus data described in Section 2. They clustered patients using hierarchical 
correlation based clustering on a subset of the most variable probe sets. 
Using unsupervised clustering, they were able to classify the patients into 
known and well-characterized subtypes as well as discover novel clusters. In 
Section 7.1 we report our clustering results on the data and compare to those 
of Figueroa et al. (2010). We ran a cluster analysis on both methylation and 
expression data separately as well as an integrative cluster analysis on both 
platforms simultaneously. In Section 7.2 we present results from a discrim- 
inant analysis study in which we classified an independent validation data 
set using the methods of Section 5. 

7.1. Clustering results. Figueroa et al. (2010) hierarchically clustered the 
n = 344 patients (methylation profiles only) on a subset of the 3745 most 
variable probe sets, using 1-correlation distance and Ward's agglomeration 
method. These were probe sets that exceeded a standard deviation threshold 
of 1. We ran the hierarchical algorithm of Section 4.1 on the same subset 
to obtain an initial partition. Among the 344 candidate partitions, obtained 
at each merging step, the loglikelihood was maximized at K = 17 clusters, 
but to avoid singletons we chose a partition with K = 16, the same number 
of clusters Figueroa et al. (2010) chose. We then applied the iterative algo- 
rithm of Section 4.2 in an attempt to improve upon the initial partition. We 
denote the resulting partition "M" (Methylation). We repeated this process 
separately for the expression data using the 3370 most variable probe sets, or 
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those that exceeded a standard deviation threshold of 0.5. This resulted in 
a partition "E" (Expression) with K = 17 clusters. Finally, we repeated this 
process jointly on the 3745 and 3370 probe sets from the methylation and 
expression data, respectively, resulting in the partition "ME" with K = 14 
clusters. 

Figueroa et al. (2010) identified 3 robust and well-characterized biological 
clusters and 8 clusters that were associated with specific genetic or epige- 
netic lesions. Five clusters seemed to share no known biological features. The 
three robust clusters corresponded to cases with inversions on chromosome 
16, inv(16), and translocations between chromosomes 15 and 17, t(15;17), 
and chromosomes 8 and 21, t(8;21) (translocations refer to when genetic 
material from two different chromosomes breaks apart and when being re- 
paired, the material from one chromosome is incorrectly attached to the 
other chromosome instead and vice versa). The World Health Organization 
has identified these subtypes of AML as indicative of favorable clinical prog- 
nosis [see, e.g., Figueroa et al. (2010)]. The remaining 8 clusters included 
patients with CEBPA double mutations (two different abnormal changes in 
the genetic code of the CEBPA gene) , CEBPA mutations irrespective of type 
of mutation, silenced CEBPA (abnormal loss of expression of CEBPA which 
is not due to mutations in the genetic code), one cluster enriched for llq23 
abnormalities (any type of change in the genetic code that affects position 23 
of the long arm of chromosome 11) and FAB M5 morphology (specific shape 
and general aspect of the leukemic cell as defined by the French American 
British classification system for Acute Leukemias), and four clusters with 
NPMl mutations (mutations in the genetic code of the NPMl gene). A de- 
tailed sensitivity and specificity analysis of 6 of the above 11 clusters [sam- 
ple sizes in brackets], inv(16) [ni = 28], t(15; 17) [na = 10], t(8;21) [ng = 24], 
CEBPA double mutations [n4 = 24], CEBPA silenced AMLs [n^ = 8], and 
11(723 -|- FAB M5 [ng = 7], is given in Table 1 for the different clustering 
results. We include the correlation based clustering result (COR) on the 
methylation data to compare with "M." The remaining five of the 11 bio- 
logical clusters (CEBPA mutations irrespective of type of mutation and the 
four NPMl mutation clusters) had sensitivity or specificity below 0.5 for all 
four clustering results and were thus excluded from the table. We can see 
that the model-based approach, "M," is doing better than the correlation 
based method, "COR," for the most part. Aside for sensitivity of t(8; 21) 
(1 less false negative) and specificity of inv(16) (1 less false positive), the 
model-based approach has as good or better sensitivity and specificity. The 
most striking differences are in the numbers of false negatives of CEBPA 
dm and false positives of t(8; 21) where "M" is doing better. Note also that 
aside for the sensitivity of llg23 -|- FAB M5 and specificity of t(8;21), the 
integrated analysis "ME" always does better than the analyses "M" and 
"E" separately, with perfect sensitivity and specificity for many of the clus- 
ters. Most notably, the integrative analysis is able to perfectly classify the 



14 



KORMAKSSON, BOOTH, FIGUEROA AND MELNICK 



Table 1 

The sensitivity and specificity of the clustering results 



Subtype 




COR 


M 


E 


IVIH; 




Sensitivity of false neg 


;atives in parentheses) 




inv(16) [ni = 28] 




0.929 (2) 


0.964 (1) 


0.857 (4) 


0.964 (1) 


t(15; 17) [na = 10] 




0.800 (2) 


0.800 (2) 


1.000 (0) 


1.000 (0) 


t(8;21) [713 = 24] 




0.917 (2) 


0.875 (3) 


0.917 (2) 


0.958 (1) 


CEBPA dm [774 = 


= 24] 


0.792 (5) 


0.917 (2) 


0.75 (6) 


1.000 (0) 


CEBPA Sil [ris = 


8] 


0.625 (3) 


0.875 (1) 


1.000 (0) 


1.000 (0) 


Ilg23 + FAB M5 


[716 = 7] 


0.857 (1) 


0.857 (1) 


0.714 (2) 


0.714 (2) 




Specificity (# of false positives in parentheses) 




inv(16) [?i — ?ii = 


316] 


1.000 (0) 


0.997 (1) 


0.997 (1) 


1.000 (0) 


t(15;17) [71-712 = 


= 334] 


1.000 (0) 


1.000 (0) 


1.000 (0) 


1.000 (0) 


i(8;21) [71-713 = 


320] 


0.972 (9) 


0.994 (2) 


1.000 (0) 


0.991 (3) 


CEBPA dm [71 - 


714 = 320] 


0.988 (4) 


0.988 (4) 


0.991 (3) 


1.000 (0) 


CEBPA Sil [77 - 715 = 336] 


1.000 (1) 


0.997 (1) 


0.985 (5) 


0.997 (1) 


Ilg23 + FAB M5 


[n — 776 = 337] 


0.991 (3) 


0.991 (3) 


0.991 (3) 


0.994 (2) 



CEBPA double mutants even though both "Ivl" and "E" have quite a few 
false positives and false negatives. This demonstrates the increased power to 
identify clusters by sharing information across multiple platforms. As a side 
product from our clustering algorithm, we obtain posterior probabilities of 
high methylation/expression, £'[t(;cj |y], which can be used to order genes in 
heatmaps to discover subtype specific methylation/expression patterns. In 
Figure 2 we see heatmaps of the two data sets used for the integrative cluster- 
ing, "]VIE," after rows have been ordered by increasing posterior probabilities 
(one cluster at a time). Such heatmaps are useful for graphically displaying 
the distinct methylation/expression patterns that characterize the different 
subtypes of cancer. 

7.2. Classification results. A second cohort of patients with AIVlL was 
available with which we could test the performance of the classification 
method of Section 5. This second cohort of n = 383 cases consisted of sam- 
ples, obtained from patients enrolled in a clinical trial from the Eastern Co- 
operative Oncology Group (ECOG) (Data are available at http:/ /www.ncbi. 
nlm.nih.gov/geo/, accession number pending). These patients were similar 
in characteristics to the Erasmus cohort, with only one exception, all pa- 
tients were younger than 60 years of age. Samples were processed in the 
same way as the Erasmus cohort, and their methylation was used to blindly 
predict their molecular diagnosis. Using the 3745 most variable probe sets 
and the clustering result "]VI" of the previous section, we fit the model (6) 
on the Erasmus cohort with the ElVI algorithm. By using the posterior pre- 
dictions of the methylation indicators, we applied the discriminant rule (11) 
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Fig. 2. Heatmaps of the methylation and gene expression values for the 6 well- charac- 
terized clusters obtained from the integrative clustering ("ME"). Columns correspond to 
patient samples and rows correspond to genes. On the left heatmap "yellow," "gray" and 
"blue" represent low, intermediate and high methylation, respectively. On the right heatmap 
"green, " "black" and "red" represent low, intermediate and high expression, respectively. 
Rows are ordered by increasing posterior probabilities of high methylation/ expression, 
E[wcj\y], first for inv{lQ), then for t(8;21), t(15;17), CEBPdm, CEBPsil, and finally 
for Ilq23-I-M5. Rows with equal probabilities for all 6 clusters (either equal to 1 or Q) are 
excluded to emphasize differences. 



on each patient in the ECOG data set. Since CEBPA and NPMl mutation 
status have not yet been made available for this cohort, only the perfor- 
mance for the prediction of the inv(16), t(8;21), CEBPA silenced, t(15;17) 
and llg23 + FAB M5 clusters could be tested. Inv(16) cases were predicted 
with 100% sensitivity and specificity. The predicted t(8; 21) cluster con- 
tained 100% of cases positive for this abnormality, and only one t(8; 21) 
case was misclassified to another cluster. Two cases, which had previously 
been unrecognized as CEBPA silenced AMLs, were predicted by the classi- 
fication method. One of them was later confirmed to indeed correspond to 
this molecular subtype by an alternative methylation measurement method. 
Similarly, one case was believed to have been misclassified as t(15; 17) since 
there were no molecular data confirming the presence of the PML-RARA 
gene fusion (the abnormal combination of the PML and RARA genes) re- 
sulting from this translocation. However, it was later confirmed that both 
the morphology and the immune diagnosis corresponded to that of an acute 
promyelocytic leukemia with t(15;17). Finally, the 11^23 + FAB M5 clus- 
ter included 9 of the 14 patients in the cohort that met these two criteria. 
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Table 2 

The sensitivity and specificity of the classification result. 
False negatives and false positives are in parentheses 



Subtype 


Sensitivity 


Specificity 


inv(16) [?i = 32] 


1.000 (0) 


1.000 (0) 


t(15;17) [n=l] 


1.000 (0) 


1.000 (0) 


t(8;21) [n = 28] 


0.964 (1) 


1.000 (0) 


CEBPA Sil [n = 1] 


1.000 (0) 


0.997 (1) 


11(?23 + FAB M5 [n = 14] 


0.643 (5) 


0.986 (5) 



There were also 5 false positives, 3 of them were M5 cases but did not have 
llg23 abnormalities, 1 of them harbored the llg23 abnormality but corre- 
sponded to an Ml, and the remaining case corresponded to an M4 case with 
a hyperdiploid karyotype. A summary of these results is provided in Table 2. 

8. Other applications. The clustering method presented in this paper 
is not restricted to the microarray platforms that the AML samples were 
processed on. In this section we demonstrate the versatility of our method 
by applying it to other microarray platforms and show that our algorithm 
does well in clustering subjects. We also provide a comparison with other 
existing methods for clustering microarray data. 

8.1. Expression in endometrial cancer. In this subsection we analyze the 
microarray expression data set in Tadesse, Sha and Vannucci (2005). En- 
dometrioid endometrial adenocarcinoma is a gynecologic malignancy typ- 
ically occurring in postmenopausal women. Identifying distinct subtypes 
based on common patterns of gene expression is an important problem, as 
different clinicopathologic groups may respond differently to therapy. Such 
subclassification may lead to discoveries of important biomarkers that could 
become targets for therapeutic intervention and improved diagnosis. High 
density microarrays (Affymetrix IIu6800 chips) were used to study expres- 
sion of 4 normal and 10 endometrioid adenocarcinomas on 7070 probe sets. 
Probe sets with at least one unreliable reading (limits of reliable detection 
were set to 20 and 16,000) were removed from the analysis, which resulted in 
G = 762 variables. Finally, the data were log-transformed, however, unlike 
Tadesse, Sha and Vannucci (2005), we chose not to rescale the variables by 
their range. More details about the data set can be found in Tadesse, Ibrahim 
and Mutter (2003) and is publicly available at http://endometrium.org. 

We hierarchically clustered the samples using the 300 most variable probe 
sets and plotted the results in Figure 3. We successfully separated the four 
normal tissues from the endometrial cancer tissues and the log-likelihood 
plot suggests that K = 2. The dendrogram consistently separated the nor- 
mals and the cancer into two branches for different variance thresholds. 
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Fig. 3. Left panel: Cluster dendrogram for the 2 normal proliferative ("P") endometria, 
2 normal secretory ("S") endometria, and 10 endometrioid adenocarcinomas ("T"). Right 
panel: Plot of the number of clusters against log-likelihood. 

However, if we lowered the threshold too much, the loglikelihood was maxi- 
mized at K = 1. This makes sense, as including many low variability probe 
sets might mask the true clustering structure. For comparison, Tadesse, Sha 
and Vannucci (2005) concluded that K = 3 and commented that there could 
possibly be 2 subtypes of endometrial cancer. However, to the best of our 
knowledge, this subclustering has not been verified. They also reported clus- 
tering results using the COSA algorithm of Friedman and Meulman (2003), 
which, like our analysis, seemed more suggestive of K = 2. 

8.2. Methylation in normal tissues. Houseman et al. (2008) used the 
RPMM algorithm to cluster a methylation data set consisting of 217 nor- 
mal tissues and compared the performance to that of the HOPACH al- 
gorithm of van der Laan and Pollard (2003). The RPMM analysis was 
discussed in more detail in Christensen et al. (2009) and the data made 
publicly available at the GEO depository with accession number GSE19434 
(http://www.ncbi.nlm.nih.gov/geo/). Briefly, DNA was extracted from 
the tissues, modified with sodium bisulfite, and processed on the Illumina 
GoldenGate methylation platform. Average fiuorescence for methylated (M) 
and unmethylated {U) alleles were derived from raw data at 1505 loci. How- 
ever, in this study 1413 loci passed the quality assurance procedures. A total 
of 11 tissue types were available, bladder (n = 5), adult blood (n = 30), in- 
fant blood (n = 55), brain (n = 12), cervix (n = 3), head and neck (n = 11), 
kidney (n = 6), lung {n = 53), placenta (n = 19), pleura (n = 18) and small 
intestine (n = 5). Houseman et al. (2008) constructed an average "beta" 
value from raw data, which they claimed was very close to the quantity 
MjiM -|- [/) € (0,1), rendering a beta distributional assumption (assum- 
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Table 3 

Cross-classification of our clustering result (K = 8) and tissue type. By merging the top 
two clusters, and clusters 3 and 4, we obtain the clustering corresponding to K — 6 



Class Blad Bl Br Cerv Infbl HN Kid Lung Plac Pleu Sm int 



15 2 1 53 18 4 

2 1 

3 6 

4 10 1 

5 30 

6 12 

7 55 

8 19 



ing M and U follow a gamma distribution) with class and locus specific 
parameters. We, however, chose to work with the quantity log{M/U), which 
fits better with our two component mixture distributional assumption. In or- 
der to get a direct comparison with RPMM and HOPACH, we used all 1413 
loci. In Supplementary Figure 3 of the supplemental article [Kormaksson 
et al. (2012)] we see a plot of cluster number versus loglikelihood, which is 
maximized at K = 6 clusters. However, given the relatively small difference 
between the loglikelihood values at K = 6, K = 7, and K = 8, one could ar- 
gue that all three clustering results are worthy of consideration. The clusters 
are cross-classified with tissue type in Table 3. 

If we compare these results with those obtained with the RPMM algo- 
rithm, our result favors few and concise clusters, whereas RPMM is in- 
dicative of a total of 23 subclasses of tissues. The HOPACH clustering 
algorithm was suggestive of K = 9 clusters, with 3 of those clusters rep- 
resenting placenta singletons separated from the main placenta cluster. We 
present a cross-classification table for both RPMM and HOPACH in Supple- 
mentary Table 3 of the supplemental article [Kormaksson et al. (2012)] for 
comparison [borrowed from Houseman et al. (2008)]. Our method perfectly 
classifies blood, brain, infant blood, kidney and placenta. For comparison, 
after bundling subclusters together, RPMM classifies blood and infant blood 
perfectly, and HOPACH classifies infant blood and placenta perfectly. All 
three methods have problems distinguishing between bladder, cervical, lung, 
pleural and small intestine tissues. Overall, our approach seems to outper- 
form HOPACH, and although Houseman et al. (2008) have demonstrated 
that a few of their tissue-specific subclusters (obtained by RPMM) have ver- 
ifiable meanings, such as through age difference, it seems that without a fur- 
ther justification of such finer substructure in the data our clustering result 
is more desirable. As a side note, under the assumption of Houseman et al. 
(2008) that M and U follow a gamma distribution, it is clear that log{M/U) 
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will not be a mixture of two Gaussian distributions. The favorable cluster- 
ing result for this data set suggests that the normality assumption on each 
mixture component provides a robust and flexible modeling distribution. 

9. Discussion. We have proposed a model-based method for clustering 
microarray data. The methods have been demonstrated to work well on ex- 
pression data and methylation data separately. An integrated cluster anal- 
ysis has further shown the power of combining platforms in a joint analysis. 
We believe this method can be applied to a variety of microarray data types. 
However, further research is needed to validate the method on different types 
of data such as ChlP-chip data. 

A minor drawback of our method is that it does not allow for automated 
selection of variables, but rather relies on pre-filtering the data. However, 
most biologists are still relying on simple clustering algorithms such as K- 
means or standard agglomerative algorithms, due to their simplicity in im- 
plementation and interpretation. Thus, having a relatively simple and easily 
implemented hierarchical algorithm that can integrate multiple platforms 
and further utilizes the bimodal or skewed structure of the individual pro- 
files, in a model-based manner, has its advantages. For example, the hi- 
erarchical algorithm automatically determines the numbers of clusters and 
provides an easily interpretable dendrogram. Also, as a side product, we 
obtain posterior probabilities of high methylation/expression, £'[t(;cj|y], for 
each cluster c and probe set j. By ordering the probe sets with respect 
to these posterior probabilities and excluding probe sets that are identical 
across all clusters, we can explore patterns in heatmaps such as in Figure 2. 

One of the novelties of our clustering algorithm is the inclusion of indi- 
vidual specific parameters, (/^ij, cr^j, /i2i, ^Uj, into the model of Section 3, 
which facilitates the use of our algorithm even in the presence of extreme 
microarray effects. Since the amount of data we have to estimate these pa- 
rameters {G observations per subject) highly exceeds the number of sub- 
jects (n), the estimation of these parameters has not been a problem. How- 
ever, it is common practice to treat such individual specific parameters as 
random effects. We have established that with conjugate normal and inverse 
gamma priors on the above parameters we arrive at a marginal likelihood 
intractable for maximization. However, we have verified that through such 
prior specifications we could easily calculate full conditionals in a Bayesian 
analysis. A Bayesian approach would also prevent us from having to as- 
sume the methylation indicators, w, are fixed as in the iterative algorithm 
of Section 4.2. The reason for that assumption was that the joint posterior 
distribution of (w,X) is highly intractable. However, full conditionals for 
each variable separately are easily obtained and are that of Bernoulli and 
Multinomial, respectively. Running a fully Bayesian analysis might also fa- 
cilitate an extended algorithm that could include all variables. One might 
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assume that some variables are informative and follow the mixture in (6) 
with prior probability p, whereas other variables are noninformative with 
prior probability (1 — p) and follow the mixture in (6) with C = [n]. There 
are some challenges that arise in implementing such a fully Bayesian model 
and those require further research. 

Acknowledgments. The methods presented in this paper have been imple- 
mented as an R-package that is available at http://www.stat.cornell.edu/ 
imac/. 

SUPPLEMENTARY MATERIAL 

Simulation and details of EM algorithms (DOT: 10.1214/11-AOAS533SUPP; 
.pdf). We perform a simulation study to assess the performance of our clus- 
tering algorithm in the presence of sparse correlation structure. We also 
derive the steps involved in maximizing the likelihoods of the several models 
presented in this paper. 
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